Mixed-precision orthogonalization scheme and its case studies with CA-GMRES on a GPU
نویسندگان
چکیده
We propose a mixed-precision orthogonalization scheme that takes the input matrix in a standard 32 or 64-bit floating-point precision, but uses higher-precision arithmetics to accumulate its intermediate results. For the 64-bit precision, our scheme uses software emulation for the higher-precision arithmetics, and requires about 20× more computation but about the same amount of communication as the standard orthogonalization scheme. Since the computation is becoming less expensive compared to the communication on new and emerging architectures, the relative cost of our mixed-precision scheme is decreasing. Our case studies with CA-GMRES on a GPU demonstrate that using mixed-precision for this small but critical segment of CA-GMRES can improve not only its overall numerical stability but also, in some cases, its performance.
منابع مشابه
Mixed-Precision Orthogonalization Scheme and Adaptive Step Size for CA-GMRES on GPUs
We propose a mixed-precision orthogonalization scheme that takes the input matrix in a standard 64-bit floating-point precision, but accumulates its intermediate results in the doubled-precision. When the target hardware does not support the desired higher precision, we use software emulation. Compared with the standard orthogonalization scheme, we require about 8.5× more computation but a much...
متن کاملMixed-Precision Orthogonalization Scheme and Adaptive Step Size for Improving the Stability and Performance of CA-GMRES on GPUs
The Generalized Minimum Residual (GMRES) method is a popular Krylov subspace projection method for solving a nonsymmetric linear system of equations. On modern computers, communication is becoming increasingly expensive compared to arithmetic operations, and a communication-avoiding variant (CA-GMRES) may improve the performance of GMRES. To further enhance the performance of CAGMRES, in this p...
متن کاملBreakthroughs in Sparse Solvers for GPUs
The CUDA Center of Excellence (CCOE) at UTK targets the development of innovative algorithms and technologies to tackle challenges in Heterogeneous High Performance Computing. Over the last year, the CCOE at UTK developed CUDA-based breakthrough technologies in sparse solvers for GPUs. Here, we describe the main ones – a sparse iterative solvers package, a communication-avoiding (CA) sparse ite...
متن کاملA Stability and Performance of Various Singular Value QR Implementations on Multicore CPU with a GPU
Singular Value QR (SVQR) can orthonormalize a set of dense vectors with the minimum communication (one global reduction between the parallel processing units, and BLAS-3 to perform most of its local computation). As a result, compared to other orthogonalization schemes, SVQR obtains superior performance on many of the current computers, where the communication has become significantly more expe...
متن کاملMixed-Precision Cholesky QR Factorization and Its Case Studies on Multicore CPU with Multiple GPUs
To orthonormalize the columns of a dense matrix, the Cholesky QR (CholQR) requires only one global reduction between the parallel processing units and performs most of its computation using BLAS-3 kernels. As a result, compared to other orthogonalization algorithms, CholQR obtains superior performance on many of the current computer architectures, where the communication is becoming increasingl...
متن کامل